REPORT LAYOUT: - Introduction - General Analysis and insights (try to find unique insights) - Analysis of factors affecting revenue - ML and prediction of select movies

Executive Summary

Project Framework: Definition and Methodology

In this report, we will explore the various factors that impact and influence the monetary success of a movie at the box office. Our investigation extends beyond mere fiscal considerations, encompassing a nuanced analysis of important factors such as the expertise of the cast and crew. By scrutinizing these diverse components, this report aims to provide a comprehensive understanding of the factors that defines a movie’s monetary success at the box office.

The data was obtained through the use of our own web scraping algorithm and covers the top 75 grossing movies over the past 25 years.

Temporal Analysis

Over time, the average revenue demonstrates a distinct upward trend, with a notable observation regarding the rate of growth in Foreign revenue compared to Domestic revenue. The surge in global revenue is primarily driven by the rapid expansion of foreign revenue, highlighting the escalating growth and acceptance of Western films in international markets.

The onset of the Covid-19 Pandemic significantly impacted the film industry, evident in the graph. Productions were halted, and theaters closed, leading to a substantial loss of earning potential. The lockdown measures globally disrupted filming schedules, postponed releases, and the closure of theaters eliminated a crucial avenue for revenue. This had a ripple effect across the industry, affecting filmmakers, actors, crew members, distributors, and exhibitors. The industry’s vulnerability to external shocks became apparent, prompting the need for innovative adaptations to navigate the challenges such as online releases.

The impact of the month of film release is a fascinating observation. Notably, films hitting the screens in May and June consistently outperform those released in other months. Utilizing an analysis of variance (ANOVA) shows a significant disparity in average revenue across different release months. Several factors contribute to this phenomenon:

  1. Summer Blockbuster Season: May and June fall within the traditional summer movie season in numerous regions. Studios strategically unveil high-budget blockbuster films during this period, targeting a broad audience. The warmer weather and school vacations further boost movie attendance.

  2. Strategic Release Patterns: The film industry acknowledges this pattern, leading to a clustering effect. Recognizing the advantageous months, more popular and anticipated films tend to be strategically released during May and June. This intentional scheduling capitalizes on the observed heightened audience engagement during these months.

  3. Genre Preferences: Certain movie genres, such as action, adventure, and fantasy, are often associated with May and June releases as seen in the graph. These genres tend to draw larger audiences and generate higher revenue, contributing to the observed pattern. (Median Revenue used to account for outliers)

    term

    df

    sumsq

    meansq

    statistic

    p.value

    Month

    11

    8.114459e+18

    7.376781e+17

    10.76122

    0

    Residuals

    1848

    1.266797e+20

    6.854964e+16

    NA

    NA

Another notable observation is the seasonality exhibited in the average revenue over a year. The seasonal strength, quantified by a value of 0.5817723, signifies a substantial recurring pattern within our data set.

This strong seasonality implies that there are recurring trends or patterns in revenue that manifest on an annual basis. It suggests that certain times of the year consistently contribute to increased or decreased revenue. Understanding and leveraging this seasonality can be pivotal for strategic decision-making in the realm of film releases.

In practical terms, this finding prompts a closer examination of the temporal distribution of revenue throughout the year. A more detailed exploration of which months or seasons contribute significantly to high or low average revenues can unveil insights that may guide release strategies, marketing efforts, or resource allocation.

trend_strength seasonal_strength_year seasonal_peak_year seasonal_trough_year spikiness linearity curvature stl_e_acf1 stl_e_acf10
0.5208086 0.5817723 5 8 8.034234e+27 1142997602 -136943061 -0.0645241 0.0550351

Film Characteristics and Insights

The genre of a film is a crucial aspect that defines its style, tone, and overall artistic expression. It serves as a blueprint, giving audiences a general idea of what to expect and helping filmmakers convey their vision effectively. The genre serves as a crucial component in the marketing and promotion of a film. It helps studios target specific demographics and tailor promotional campaigns to reach the intended audience.

As illustrated in the chart, Sci-Fi and Adventure emerge as the most lucrative genres within the film industry. This can be primarily attributed to the presence of many blockbuster titles within these specific genres. Despite the presence of outliers, which could represent exceptional cases or singular phenomena, the overarching trend reflected in the chart suggests a consistent and widespread favoritism towards Sci-Fi and Adventure genres. This pattern implies that audiences are consistently drawn to these genres, reinforcing their status as the forefront contributors to the film industry’s financial success.

Upon analyzing the revenue distributions across genres, a notable observation emerged: the box plots for the Fantasy and Family genres exhibited remarkable similarity. This observation prompted a deeper exploration into the correlations among various genre combinations.

A noteworthy finding was the high positive correlation between the Animation and Family genres. This correlation aligns seamlessly with the prevalent trend of animated family-oriented films. Conversely, an intriguing insight surfaced when examining the negative correlation between Thriller and Comedy genres. This distinctive relationship suggests an unconventional pairing that has not been extensively explored in the cinematic landscape.

This negative correlation sparks a thought-provoking notion — the potential for an innovative and revolutionary genre combination. The rarity of Thriller-Comedy hybrids in the current cinematic landscape presents an opportunity for filmmakers to experiment. This unexplored territory not only provides creative potential but also introduces the possibility of captivating a diverse audience with a novel cinematic experience.

Movies often navigate across various genres to broaden their appeal and cater to a diverse demographic. The trend indicates that a film’s performance tends to improve as it incorporates multiple genres, with the optimal balance appearing to be around six genres.

cast <- df %>% 
  dplyr::select(Title, Worldwide, Cast) %>% 
  tidyr::separate_rows(Cast, sep = ", ") %>% 
  dplyr::group_by(Cast) %>% 
  dplyr::summarise(AvgRev = mean(Worldwide),
                   Count = n()) %>% 
  dplyr::mutate(Movies_Acted = case_when(
    Count >= 5 & Count <= 10 ~ '5-10',
    Count > 10 & Count <= 15 ~ '10-15',
    Count > 15 & Count <= 20 ~ '15-20',
    Count > 20 ~ '20+',
    TRUE ~ 'Less than 5'
  )) %>% 
  dplyr::group_by(Movies_Acted) %>% 
  dplyr::summarise(AvgRev = mean(AvgRev),
                   Count = n()) %>% 
  dplyr::arrange(desc(AvgRev))
cast
## # A tibble: 5 × 3
##   Movies_Acted     AvgRev Count
##   <chr>             <dbl> <int>
## 1 10-15        340892604.    52
## 2 20+          325170485.     9
## 3 5-10         297994310.   291
## 4 15-20        291142359.    31
## 5 Less than 5  200239184.  2763
# Changing preference for newer faces or different types of story telling. Still like regulars 
range_order <- c("Less than 5", "5-10", "10-15", "15-20", "20+")
cast$Movies_Acted <- factor(cast$Movies_Acted, levels = range_order)
ggplot(cast, aes(x = Movies_Acted, y = AvgRev, fill = factor(Count))) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  scale_fill_viridis_d() +
  labs(title = "Average Revenue by Number of Movies Acted",
       x = "Movies_Acted",
       y = "Average Revenue",
       fill = "Count") +
  theme_minimal()

star <- df %>% 
  dplyr::select(Worldwide, Star) %>% 
  dplyr::group_by(Star) %>% 
  dplyr::summarise(AvgRev = mean(Worldwide),
                   Count = n()) %>% 
  dplyr::filter(Count >= 5) %>% 
  dplyr::arrange(desc(AvgRev))
star
## # A tibble: 93 × 3
##    Star                   AvgRev Count
##    <chr>                   <dbl> <int>
##  1 Robert Downey Jr. 1065872463.    11
##  2 Chris Pratt        942798923.     9
##  3 Tom Holland        894681963.     5
##  4 Daniel Radcliffe   873331103.     9
##  5 Elijah Wood        700348692.     5
##  6 Daniel Craig       584913297.     8
##  7 Jing Wu            579671044.     6
##  8 Tobey Maguire      548438356.     5
##  9 Kristen Stewart    542133565.     7
## 10 Chris Hemsworth    521315834      6
## # ℹ 83 more rows
star_plot <- star %>%  
  ggplot(aes(x = Count, y = AvgRev, size = Count, color = Count,
             text = paste("Star:", Star, "<br>Number of Movies:", Count, "<br>Average Revenue:", scales::dollar(AvgRev)))) +
  geom_point() +
  labs(title = "Movie Stars and Avg Revenue",
       x = "Number of Movies",
       y = "Average Revenue",
       size = "Number of Movies")
plotly::ggplotly(star_plot, tooltip = "text")
#RUN REGRESSION ANALYSIS

writer <- df %>% 
  dplyr::select(Title, Worldwide, Writer) %>% 
  stats::na.omit() %>% 
  dplyr::mutate(Writer_Count = str_count(Writer, ",") + 1) %>%
  dplyr::mutate(Grouped_Writer_Count = ifelse(Writer_Count >= 10, 10, Writer_Count)) %>%
  dplyr::group_by(Grouped_Writer_Count) %>% 
  dplyr::summarise(AvgRevenue = mean(Worldwide),
                   Count = n())

writer
## # A tibble: 10 × 3
##    Grouped_Writer_Count AvgRevenue Count
##                   <dbl>      <dbl> <int>
##  1                    1 192412071.   316
##  2                    2 227805328.   434
##  3                    3 210036030.   352
##  4                    4 282824821.   255
##  5                    5 301414854.   184
##  6                    6 307590921.   118
##  7                    7 369320972.    71
##  8                    8 390427176.    37
##  9                    9 363805457     39
## 10                   10 477376779.    52
# FIND AND GRAPH THE TOP DIRECTORS
# DIRECTOR GENRE GRAPH

director <- df %>% 
  dplyr::select(Title, Worldwide, Director) %>% 
  dplyr::mutate(Director_Count = str_count(Director, ",") + 1) %>% 
  dplyr::group_by(Director_Count) %>% 
  dplyr::summarise(AvgRevenue = median(Worldwide),
                   Count = n()) %>% 
  dplyr::filter(Count > 2)

director
## # A tibble: 4 × 3
##   Director_Count AvgRevenue Count
##            <dbl>      <dbl> <int>
## 1              1  162091208  1641
## 2              2  180513586   187
## 3              3  256786742    25
## 4              4  191439347     6

MACHINE LEARNING

library(dplyr)
library(parsnip)
library(ggplot2)
# Split data
set.seed(123)  # Set a seed for reproducibility
#df_shuffled <- df[sample(nrow(df)), ] # Shuffle data to elimate some bias hopefully/
df_split <- df %>% rsample::initial_split(prop = 0.80)
df_train <- rsample::training(df_split)
df_test  <- rsample::testing(df_split)

# Worldwide as dependent variable
recipe_pipeline <- recipes::recipe(Worldwide ~ Budget + Distributor + `Release Month` + MPAA + `Run Time (Mins)` + `First Genre`+`count_genres`, data = df_train) %>%
  # step_rm(date) %>%
  recipes::prep()

train_baked <- recipes::bake(recipe_pipeline, df_train)

recipe_pipeline <- recipes::recipe(Worldwide ~ Budget + Distributor + `Release Month` + MPAA + `Run Time (Mins)` + `First Genre`+`count_genres`, data = df_test) %>%
  # step_rm(date) %>%
  recipes::prep()

test_baked <- recipes::bake(recipe_pipeline, df_test)  # Corrected to use test_baked


common_levels_distributor <- intersect(levels(train_baked$Distributor), levels(test_baked$Distributor))
train_baked$Distributor <- factor(train_baked$Distributor, levels = common_levels_distributor)
test_baked$Distributor <- factor(test_baked$Distributor, levels = common_levels_distributor)



# Modeling
model <- parsnip::decision_tree(mode = "regression") %>%
  parsnip::set_engine("rpart") %>%
  parsnip::fit(Worldwide ~ Budget + Distributor + `Release Month` + MPAA + `Run Time (Mins)` + `First Genre`+`count_genres`, data = train_baked)
model
## parsnip model object
## 
## n= 1492 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 1492 1.122213e+20  261062300  
##    2) Budget< 1.445e+08 1335 5.514783e+19  215972100  
##      4) Budget< 8.9e+07 1083 2.741099e+19  179967600 *
##      5) Budget>=8.9e+07 252 2.029945e+19  370705400  
##       10) First Genre=Biography,Comedy,Crime,Documentary,Drama,Fantasy,Horror 41 7.609659e+17  218163700 *
##       11) First Genre=Action,Adventure,Mystery 211 1.839907e+19  400346200  
##         22) Run Time (Mins)< 139 151 5.717508e+18  337193600 *
##         23) Run Time (Mins)>=139 60 1.056373e+19  559280300  
##           46) MPAA=R 14 3.150352e+17  270567100 *
##           47) MPAA=PG,PG-13 46 8.726558e+18  647149600  
##             94) Release Month=Aug,Jan,Jun,Sep 13 1.037251e+18  376371400 *
##             95) Release Month=Apr,Dec,Feb,Jul,Mar,May,Nov,Oct 33 6.360646e+18  753819800 *
##    3) Budget>=1.445e+08 157 3.127963e+19  644473100  
##      6) Budget< 2.185e+08 133 1.527063e+19  562482600  
##       12) Run Time (Mins)< 120.5 66 3.358576e+18  448968700 *
##       13) Run Time (Mins)>=120.5 67 1.022388e+19  674302200  
##         26) Distributor=Lions Gate Films,Paramount Pictures,Sony Pictures Entertainment (SPE),Twentieth Century Fox,Universal Pictures,Warner Bros. 55 5.861216e+18  594204700 *
##         27) Distributor=20th Century Studios,DreamWorks,Walt Disney Studios Motion Pictures 12 2.392533e+18 1041416000 *
##      7) Budget>=2.185e+08 24 1.016019e+19 1098837000  
##       14) Release Month=Jul,Jun,Mar,May,Nov,Oct 17 1.854293e+18  799961800 *
##       15) Release Month=Apr,Dec 7 3.099440e+18 1824678000 *
test_predictions <- predict(model, new_data = test_baked)
test_predictions
## # A tibble: 373 × 1
##         .pred
##         <dbl>
##  1 218163671.
##  2 179967611.
##  3 179967611.
##  4 179967611.
##  5 179967611.
##  6 179967611.
##  7 179967611.
##  8 179967611.
##  9 179967611.
## 10 179967611.
## # ℹ 363 more rows
# Plot tree
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.2
## Loading required package: rpart
## 
## Attaching package: 'rpart'
## The following object is masked from 'package:dials':
## 
##     prune
rpart.plot(
  model$fit,
  roundint = F,
  cex = 1,
  fallen.leaves = F,
  extra = "auto",
  main = "Regression Tree"
)

# Results
res <- model %>% predict(new_data = test_baked) %>%
  bind_cols(test_baked %>% dplyr::select(Worldwide))
res %>% yardstick::metrics(truth = Worldwide, estimate = .pred)
## # A tibble: 3 × 3
##   .metric .estimator     .estimate
##   <chr>   <chr>              <dbl>
## 1 rmse    standard   216512197.   
## 2 rsq     standard           0.283
## 3 mae     standard   132896811.
res %>% ggplot(aes(x = .pred, y = Worldwide)) + geom_point() +
  labs(title = "Prediction vs Actual",
       subtitle = "Decision Tree - Regression")

# Assuming 'model' is your decision tree model and 'df_test' is your testing data
library(parsnip)
library(yardstick)
library(vip)
## Warning: package 'vip' was built under R version 4.3.2
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
# Assess variable importance
importance <- model %>% vip()

# Assess variable relationships
plot(importance)

ui <- fluidPage(
  titlePanel("Movie Revenue Prediction"),
  sidebarLayout(
    sidebarPanel(
      # Input fields for user to enter movie details
      textInput("budget", "Budget", value = ""),
      textInput("distributor", "Distributor", value = ""),
      dateInput("release_date", "Release Date", value = ""),
      selectInput("mpaa_rating", "MPAA Rating",
                  choices = c("PG", "PG-13", "R"), selected = "PG-13"),
      numericInput("director_count", "# of Directors", value = 1),
      numericInput("writer_count", "# of Writers", value = 1),
      textInput("star", "Star", value = ""),
      numericInput("duration", "Duration (Mins)", value = 120),
      numericInput("genres_count", "# of Genres", value = 1),
      selectInput("primary_genre", "Primary Genre",
                  choices = c("Adventure", "Comedy", "Fantasy", "Animation", "Family", 
                              "Biography", "Drama", "History", "Action", "Sci-Fi", 
                              "Crime", "Mystery", "Thriller", "Musical", "Romance", 
                              "Horror", "Sport", "Documentary", "Music", "War", 
                              "Western", "Short"), 
                  selected = "Action"),
      # ...

      actionButton("predict_button", "Predict Revenue")
    ),
    mainPanel(
      # Display the predicted revenue as a numeric value
      textOutput("prediction_output")
    )
  )
)

server <- function(input, output) {
  predicted_revenue <- eventReactive(input$predict_button, {
    new_data <- tibble(
      Budget = as.numeric(input$budget),
      Distributor = input$distributor,
      `Earliest Release Date` = as.Date(input$release_date),
      MPAA = input$mpaa_rating,
      Director_Count = as.numeric(input$director_count),
      Writer_Count = as.numeric(input$writer_count),
      Star = input$star,
      `Run Time (Mins)` = as.numeric(input$duration),
      Genres_Count = as.numeric(input$genres_count),
      Primary_Genre = input$primary_genre,

    )

    # Bake the new data using the updated recipe
    new_data_baked <- recipes::bake(recipe_pipeline2, new_data)

    # Make predictions using the trained model
    predictions <- predict(lm_fit, new_data_baked)

    formatted_output <- sprintf("Predicted Revenue: %.2f", as.numeric(predictions))
    
    formatted_output
  })

  # Display the predicted revenue as a numeric value
  output$prediction_output <- renderText({
    predicted_revenue()
  })
}

# Run the Shiny app
shinyApp(ui, server)
Shiny applications not supported in static R Markdown documents